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Human immunodeficiency virus (HIV) evolves with extraordinary rapidity. However, its evolution 
is constrained by interactions between mutations in its fitness landscape. Here we show that an 
Ising model describing these interactions, inferred from sequence data obtained prior to the use 
of antiretroviral drugs, can be used to identify clinically significant sites of resistance mutations. 
Successful predictions of the resistance sites indicate progress in the development of successful models 
of real viral evolution at the single residue level, and suggest that our approach may be applied to 
help design new therapies that are less prone to failure even where resistance data is not yet available. 


Under selective pressure from sub-optimal anti¬ 
retroviral treatment regimens, HIV has been ob¬ 
served to evolve drug resistance within weeks of 
treatment initiation [T|. While modern combina¬ 
tion therapies have greatly reduced the rate of evo¬ 
lution of drug resistance, resistant strains are found 
in greater than 14% of newly infected patients in 
the United States H OH- The rapid evolution of 
resistance is congruent with the overall observation 
that HIV evolution is remarkably fast, with studies 
indicating that in the absence of treatment a sin¬ 
gle patient’s HIV infection will explore every pos¬ 
sible point mutation many times daily @H5]. How¬ 
ever, empirical studies of viral sequence data indi¬ 
cate that HIV evolution is structured and exhibits 
reproducible patterns mm 

The existence of significant correlations in the evo¬ 
lution of HIV suggests that sequence data can be 
used to parameterize statistical mechanical models 
of HIV evolution that predict important features of 
its evolution, including the evolution of drug resis¬ 
tance. Previous researchers have used a variety of 
approaches to predict HIV fitness and aspects of its 
evolution using viral sequence data on its own Em 
and with additional phenotypic properties such as 
drug resistance and replicative capacity [5). Oth¬ 
ers have addressed the problem of predicting the 
sites of drug resistance mutations by detecting sites 
under positive selection during treatment [TO], su¬ 
pervised learning m, and structural modeling of 
protein-drug interactions nung. 

Here we use HIV sequence data, obtained prior to 
the widespread clinical use of protease inhibitors, to 


parameterize a spin representation of the standard 
Eigen model of quasi-species evolution [bTHTo] . We 
then use this model to predict sets of sites in HIV 
protease where joint mutations are unlikely to sig¬ 
nificantly impair viral fitness. We hypothesize that 
such sites are more likely to be sites of clinically rel¬ 
evant drug resistance mutations because resistance 
mutations that severely impair viral replication are 
unlikely to be selected. Our successful identifica¬ 
tion of major drug resistance sites (defined in [16| ) 
using natural evolution data suggests that our tech¬ 
niques could be applied to predict HIV evolution in 
response to new treatment regimens or vaccine can¬ 
didates when resistance information is unknown. 

We begin by inferring an estimate of the prob¬ 
ability distribution of mutations in the viral pro¬ 
tease from sequence data. Protease amino acid se¬ 
quences are first translated into a binary form, with 
the amino acid at each site i encoded by Si € {0,1}, 
where 0 (1) denotes a wild type (mutant) amino acid 
at that site. Full sequences are thus represented as 
vectors s = (si, S 2 ,sgg). We assume that the joint 
distribution of mutations is adequately captured by 
the moments (siSj) and find the maximum entropy 
distribution consistent with the observed moments 
(note that because s 2 = Sj, (s*) = (s 2 ) all first mo¬ 
ments are included) 0 :17]. The resulting probabil¬ 
ity distribution takes the form 

P(s) = Z _1 exp(— E(s)) 

l l m \ 

E(s) = JijSiSj + hiSi , 

i<j i=l 



FIG. 1. The coupling —J between a pair of sites in¬ 
creases sharply as the fitness of the double mutant ap¬ 
proaches the fitness of the wild type sequence. Top 
panel: The peak in — J occurs at the level crossing, where 
—jf = h{ + h{. If — becomes larger than h{ + h{, so 
that the double mutant has higher fitness than the wild 
type sequence (shaded region), the corresponding cou¬ 
pling — J in the prevalence landscape decreases. Bottom 
panel: Log fitness of the wild type strain and the dou¬ 
ble mutant strain as a function of J* . The log fitnesses 
intersect at the point where — J is maximized. 

where Z is the partition function. The parameters 
{ ,Jjj }, {hi} must be chosen such that the distribution 
P(s) reproduces the observed moments ( SiSj ). Here 
the {Jr,} can be thought of as capturing direct in¬ 
teractions between sites, disentangled from the net¬ 
work of correlations that include indirect effects me¬ 
diated through intermediate sites [18U20] . Similar 
maximum entropy approaches have been fruitfully 
applied to analyze patterns of neural activity and to 
predict contact residues in protein families nT)l - f22] . 
The description of the selective cluster expansion al¬ 
gorithm used to infer E(s) is given in [18) 23]. Al¬ 
though only the pair correlations are constrained in 
Eq. |Tj the inferred Ising model accurately predicts 
higher order correlations as well. 

The form of the probability distribution gives 
rise to the notion of a “prevalence landscape” that 
expresses the relative frequencies of protease se¬ 
quences. Previous work has shown that the inferred 
prevalences of sequences from HIV Gag proteins 
correlate with their replicative capacities, another 
proxy for fitness ® m, in line with the intuition 
that fitter strains should be more prevalent. How¬ 
ever, prevalence is affected by many factors other 
than fitness, including epidemiological dynamics, re¬ 
combination, and demographic noise, which compli¬ 


cate this association [2oIf2T| . 

Insight into the relation between fitness and preva¬ 
lence can be obtained through Eigen’s model of evo¬ 
lution ini. This model assumes an infinite popula¬ 
tion of viruses, and accounts for mutation and selec¬ 
tion, but neglects many of the important effects de¬ 
scribed above. However, these simplifications allow 
for the relationship between fitness and prevalence 
to be studied using methods adopted from statisti¬ 
cal physics ii- Following Eigen’s model, the 
prevalence can also be written as the outcome of 
evolutionary dynamics over a large number of gen¬ 
erations T, represented as a series of coupled Ising 
spin systems [13] 

exp(— E(s t )) oc 

"T-l 

E exp ^A-(2 S ‘-l)(2 S t+1 -l)- J P( S t )) 
{»*}£? Lt=i 

L L 

p{ s ) = y ' j(j s i s 3 j - y ' Si, 

i<j i=l 

( 2 ) 

where K is related to the per site per generation 
mutation rate /i by A' = |log(^^). Here F(s) 
is minus the log fitness of sequence s and will be 
referred to as the fitness landscape. The super¬ 
scripts t € {1,2... ,T} on the sequence vectors re¬ 
fer to discrete generations. The superscript / on 
the parameters {h{} and {j/j } indicates that these 
parameters are taken from the fitness landscape in 
Eq. [2] (assumed to have the same functional form 
as Eq. [I]), rather than the prevalence landscape of 
Eq. [T] The evolutionary dynamics described here 
applies to evolution within a population of hosts. 
Equations describing within host evolution would 
require accounting for differing immune pressure be¬ 
tween individuals m , though protease is not com¬ 
paratively immunogenic 1281 . 

Ideally, one would like to invert Eq. [2] to solve for 
F(s) in terms of E(s), because E(s) is inferred di¬ 
rectly from data. In principle, this could be achieved 
by matching the distribution of sequences at the final 
generation T in the Ising representation of Eigen’s 
model with the prevalence landscape, given by Eq. [T| 
This is a challenging problem in general, however, 
approximate results can be obtained by studying a 
two site system, which can be solved by straightfor¬ 
ward transfer matrix methods. While network ef¬ 
fects influence the inferred couplings between sites, 
this simple approximation provides useful intuition. 
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Furthermore, network effects exert a weaker influ¬ 
ence on the ( Si ), as most of their variance is ex¬ 
plained by the single site hi in Eq. |Tj 

Solving the two site version of Eq. [2] shows that 
the h* are difficult to reliably infer, because the mu¬ 
tation coupling K is large enough (K ~ — log(/x) 
and n — £>(10 -4 )) that very small lead to large h 
in the prevalence landscape. However, large values 
of J in the prevalence landscape have a simple inter¬ 
pretation in the fitness landscape as couplings be¬ 
tween pairs of sites where mutating both sites leads 
to only a small change in fitness compared to wild 
type (Fig. 0 . In this case the double mutant could 
become advantageous with only a small increase in 
the fitness of one of the mutations, as might occur 
when drugs are added to the environment, for exam¬ 
ple. Mathematically, this occurs as — Jf approaches 
h{ + ^ 2 - We refer to the point in parameter space 
where the coupling between sites allows the double 
mutant strain to have equal fitness to the wild type 
as a level crossing. 

To go from the interpretation of large values of — J 
in the prevalence landscape as indicators of nearby 
level crossings to predictions of resistance muta¬ 
tions requires elucidating a relationship between 
level crossings and resistance mutations. A rigor¬ 
ous argument relating resistance mutations to the 
fitness landscape would require detailed knowledge 
of the drug, its binding sites, the structure of the tar¬ 
get protein, and other details. However, generically 
we expect that when the environment in which HIV 
replicates changes due to the initiation of drug ther¬ 
apy, HIV must mutate in ways that abrogate drug 
binding, while at the same time preserving protein 
function. Large couplings — J connect sites that are 
likely to be able to co-mutate with limited costs to 
fitness, even if the associated individual mutations 
are costly. Such sets of sites are therefore more likely 
to be associated with resistance. Here our assump¬ 
tion is that resistance cannot be achieved through 
selectively neutral mutations at single sites, in which 
case drug treatment would likely be ineffective. 

To predict the sites of resistance mutations 
based on the above considerations, we consider the 
strongest couplings —Jij associated with each site i. 
Using the largest coupling values we then assign each 
site a rank r £ { 1,..., 99} from strongest to weakest. 
We predict that the sites with the strongest interac¬ 
tions (i.e. the highest ranked sites) are most likely 
to be associated with drug resistance. Focusing on 
the highest ranked sites, and the strong couplings 



FIG. 2. Stronger couplings are more likely to link sites of 
major resistance mutations. Here we show the network of 
interactions between the top r ranked sites, from r = 40 
(left) to r = 10 (right). Only the strongest couplings, 
those meeting or exceeding the largest coupling for the 
lowest ranked site, are displayed. Interactions linking at 
least one major resistance site are darkly shaded, links 
between non-resistance sites are lightly shaded. 


between them, can be seen as a process of pruning 
weaker interactions from the network. Three pruned 
versions of the network of mutational interactions in 
HIV protease are shown in Fig. [2] 

This model can be cast in the form of a classifica¬ 
tion rule by predicting sites ranked at or above some 
threshold rank r to be sites of drug resistance mu¬ 
tations, and sites of lower rank to be unassociated 
with resistance. To test the model’s performance, 
we take the set of resistance sites to be those clas¬ 
sified as sites of major resistance mutations by the 
Stanford HIV drug resistance database (sites 30, 32, 
33, 46, 47, 48, 50, 54, 76, 82, 84, 88, and 90) US]. 
As higher ranked sites are selected, the proportion 
of sites that are associated with resistance should 
increase. This can be measured using positive pre¬ 
diction value (PPV) and negative prediction value 
(NPV), defined as 

P(true = resistance | predicted = resistance), 

P(true = non-resistance | predicted = non-resistance). 

These are shown in Fig. [3] compared to benchmarks 
for a random classifier, and demonstrate that the 
performance of the classification rule is substantially 
better than chance for higher ranked sites. Exami¬ 
nation of the true positive rate (TPR) and false pos¬ 
itive rate (FPR), 

P(predicted = resistance|true = resistance), 

P(predicted = resistance|true = non-resistance), 

shown in Fig. [3] confirm that TPR>FPR, indicating 
performance better than chance. We also note that 
the fraction of the strongest interactions which link 
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FIG. 3. Top-ranked sites, based on the maximum 
strength of their couplings, are far more likely to be sites 
of major drug resistance mutations than would be ex¬ 
pected by chance, (a) Positive prediction value (PPV) 
and negative prediction value (NPV) for the classifier 
compared to the benchmark of random guessing, as a 
function of rank. Collections of the highest ranked sites 
are clearly associated with improved PPV. (b) False pos¬ 
itive rate (FPR) and true positive rate (TPR) as func¬ 
tions of rank. TPR > FPR indicates performance better 
than chance. 


at least one major drug resistance site is extremely 
high, as can be seen in Fig. [2] (further details in 
Supplemental Material). 

To examine these results using classical statisti¬ 
cal significance testing, we used the hyper-geometric 
distribution to compute p-values for the null model 
of randomly selecting the number of sites at or above 
each rank threshold and obtaining at least as many 
resistance mutations as found using the ranking clas¬ 
sifier. The predictions have p-values < 0.05 for es¬ 
sentially all rank thresholds from r = 3 — 50, which 
comports with the argument that strongly coupled 
sites are more likely to be sites of resistance mu¬ 
tations and supports the significance of the predic¬ 
tions of resistance among higher ranked sites. The 
lack of significance for the highest ranked pair is a 
consequence of the very small number of sites. We 
also tested related classification rules constructed us¬ 
ing direct information m and correlation matrices, 
with no improvement in performance (Supplemental 
Material). 

As virological failure occurs in patients undergo¬ 
ing treatment with protease inhibitors, new protease 
inhibitor drugs are administered [2]. To further 
assess the validity of our predictions, we used the 
model to infer pairs of protease inhibitors that are 
optimized to protect patients from evolving multi¬ 
drug resistance. To protect against resistance, a 
pair of drugs should have as many non-overlapping 
resistance mutations as possible. Additionally, the 


drugs’ associated resistance mutations should be dif¬ 
ficult to make simultaneously due to fitness con¬ 
straints. In the same way that large positive val¬ 
ues of — J indicate sites that can readily mutate to¬ 
gether, negative values of —J indicate sites where 
double mutations are suppressed. Thus, the inter¬ 
actions between the resistance mutations that are 
not common to both drugs should be as negative as 
possible. We found three combinations (atazanivir- 
indinavir, atazanavir-fosamprenavir, and darunavir- 
nelfanavir) that are optimal for both of these crite¬ 
ria in the Pareto sense: improvement in one crite¬ 
rion necessitates a reduction in the other criterion. 
Two of these, along with both near-optimal pairs 
(atazanavir-darunavir and atazanavir-lopinavir), in¬ 
corporate atazanavir, consistent with clinical knowl¬ 
edge that the resistance profile of atazanavir appears 
distinct from other protease inhibitors |2!)i . 

The network of large interactions also captures im¬ 
portant biophysical information. As a first example, 
the third strongest coupling is between sites 82 and 
54. Site 82 is frequently the first resistance muta¬ 
tion site observed after the initiation of protease in¬ 
hibitor treatment, and is usually followed by muta¬ 
tion at site 54 jTj. Some couplings may also be asso¬ 
ciated with stabilizing mutations, which compensate 
for loss of fitness due to a destabilizing mutation. A 
recent biophysical study examined the melting tem¬ 
peratures of HIV protease with a major resistance 
mutation at site 84 [3D]. The study showed that on 
its own, the major resistance mutation reduced the 
stability of HIV protease considerably. When the 
mutation at site 84 is accompanied by one of a set 
of three known accessory mutations at sites 10, 63, 
and 71, stability is restored, or even enhanced. Cou¬ 
plings between sites 10 and 84, and sites 63 and 84, 
are strong, in the top 7% of all couplings (though 
weaker than the couplings shown in Fig. [2j which 
are within the top 1%). The coupling between sites 
71 and 84 is slightly weaker, but still in the top 13% 
of all couplings. This suggests that links between 
destabilizing mutations and those that improve pro¬ 
tein stability may be captured by the network of 
interactions inferred from sequence data. 

Our results show that from sequence information 
alone, much of the evolutionary response of HIV pro¬ 
tease to inhibitors can be reproduced. While in the 
case of protease inhibitors, the answer was known, 
the successful retrodictions indicate that our under¬ 
standing of HIV evolution is becoming predictive at 
the level of individual residue sites. We anticipate 
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that the methods developed above will contribute to 
the development of predictive theories of viral evo¬ 
lution and to the development of new treatments, 
such as integrase inhibitors [31] , where resistance is 
not nearly as well characterized as in protease. 
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